Method and apparatus for compressing data of storage system, device, and readable storage medium

ABSTRACT

In a method of storing data block, a storage device has stored a plurality of data block groups, each data block group having a common part that is contained in another data block in that group. For a target block to be stored, the storage device selects from the data block groups a target data block group has one data block whose common part is identical to a part of the target data block. The storage device then saves the target block by storing a target reference block of the target data block group and differential data between the target data block and the target reference block.

CROSS-REFERENCE TO RELATED APPLICATIONS

This is a continuation of International Patent Application No.PCT/CN2019/097144, filed on Jul. 22, 2019. The disclosure of theaforementioned application is hereby incorporated by reference in itsentirety.

TECHNICAL FIELD

This application relates to the field of storage technologies, and inparticular, to a method and an apparatus for compressing data of astorage system, a device, and a readable storage medium.

BACKGROUND

With rapid development of big data, cloud computing, and artificialintelligence, enterprises have an explosive growth in data storagerequirements. If data is directly stored, relatively large storage spaceis occupied, and costs are relatively high. To improve utilization ofstorage space, a data reduction technology is usually used to compressdata.

In a related technology, a deduplication technology is generally used toimprove the utilization of storage space. To be specific, a file isdivided into data blocks of a same size, and a deduplication fingerprintof each data block is calculated. Because a same deduplicationfingerprint indicates that content of data blocks is the same, datablocks with a same deduplication fingerprint can be stored only once.

When the deduplication technology is used, redundant data can be deletedonly when content of data blocks is the same. During actual datastorage, however, there is a low probability that there are data blocksthat are completely the same. Therefore, a data reduction effect ispoor.

SUMMARY

Embodiments of this application provide a method and an apparatus forcompressing data of a storage system, a device, and a readable storagemedium, to overcome a problem of a poor data reduction effect in arelated technology.

According to an aspect, this application provides a method forcompressing data of a storage system. The method includes: determiningwhether deduplication can be performed on a target data block; whendeduplication cannot be performed on the target data block, obtaining asimilar fingerprint of the target data block; determining, based on thesimilar fingerprint, a combined data block group to which the targetdata block belongs; and performing similar compression on the targetdata block based on a reference block in the combined data block group.

In a solution shown in this embodiment of this application, when thestorage system stores data blocks in batches, the storage systemdetermines whether deduplication can be performed on a target data blockthat refers to any one of the data blocks. When the storage systemdetermines that deduplication cannot be performed on the target datablock, the storage system may obtain the similar fingerprint of thetarget data block. The storage system may determine the similarfingerprint of the target data block before determining whetherdeduplication can be performed on the target data block, or whendetermining that deduplication cannot be performed on the target datablock. A determining manner may be: splitting the target data block intoequal-sized data units, and separately inputting each data unit into apreset hash function, to obtain an output result, namely, the similarfingerprint of the target data block. It can be learned that the similarfingerprint of the target data block is not one numeric value, butincludes a group of numeric values.

After obtaining the similar fingerprint of the target data block, thestorage system may determine the combined data block group to which thetarget data block belongs. Data blocks included in the combined datablock group may be compressed together. The storage system may furtherdetermine the reference block in the combined data block group. If thetarget data block is not the reference block in the combined data blockgroup, the storage system may determine differential data between thetarget data block and the reference block, and compress the differentialdata. If the target data block is the reference block in the combineddata block group, the storage system may compress the target data in aconventional compression manner, and perform similar compression on theother data blocks in the combined data block group in a same manner asthe target data block. In this way, similar compression anddeduplication are combined. When deduplication cannot be performed,similar compression can be used to further compress some data, toimprove a reduction rate.

In a possible implementation, the determining whether deduplication canbe performed on a target data block includes: generating a deduplicationfingerprint of the target data block; and querying whether the storagesystem has a fingerprint the same as the deduplication fingerprint, todetermine whether deduplication can be performed on the target datablock.

In the solution shown in this embodiment of this application, adeduplication fingerprint table is recorded in the storage system. Thededuplication fingerprint table includes a deduplication fingerprint ofa data block that is compressed and stored, a deduplication fingerprintof a received data block that is not compressed, and metadatainformation of the corresponding data block. The storage system mayinput the target data block into a fingerprint extraction function, toobtain the deduplication fingerprint of the target data block as anoutput result. The fingerprint extraction function may be a hashfunction. Then, the storage system determines, in the deduplicationfingerprint table by using the deduplication fingerprint, whether thededuplication fingerprint exists in the received data block that is notcompressed. If the deduplication fingerprint exists in the received datablock that is not compressed, it may indicate that a same data blockexists. In this case, deduplication can be performed on the target datablock. If the deduplication fingerprint does not exist in the receiveddata block that is not compressed, it may indicate that the target datablock does not exist. In this case, deduplication cannot be performed onthe target data block, the target data block needs to be directlycompressed in a conventional manner (for example, Huffman encoding), anda compressed data block is stored. In this way, whether deduplicationcan be performed on the target data block can be accurately determined.

In a possible implementation, the determining whether deduplication canbe performed on a target data block includes: determining a load of thestorage system to determine whether deduplication can be performed onthe target data block.

In the solution shown in this embodiment of this application, the loadof the storage system directly affects storage efficiency of a datablock. The storage system may determine the load of the storage system,and determine whether the load meets a load exceeding condition. Theload may be reflected by a central processing unit (CPU) usage, astorage space usage, and a current time period. If the load meets theload exceeding condition, the storage system performs deduplication toimprove processing efficiency of the storage system. If the load doesnot meet the load exceeding condition, the storage system has highprocessing efficiency and does not perform deduplication. In this way,the processing efficiency of the storage system can be improved bydetermining the load.

In a possible implementation, the method further includes: consecutivelystoring, in a same storage block, compressed data obtained after similarcompression is performed on the target data block, and compressed dataof another data block in the combined data block group.

In the solution shown in this embodiment of this application, whencompressed data obtained after similar compression is performed on thetarget data block is stored, a storage block in which the compresseddata of the other data block in the combined data block group is storedand a storage location of the compressed data in the storage block maybe determined. Then, the compressed data obtained after similarcompression is performed on the target data block and the compresseddata of the other data block in the combined data block group areconsecutively stored together. In this way, during data reading,differential data and data of the reference block can be read at a time.This can improve data reading efficiency.

In a possible implementation, if there are a plurality of data blocksother than the reference block in the combined data block group,compressed data of m data blocks is before the data of the referenceblock, and compressed data of n data blocks is after the referenceblock, where a difference between m and n is equal to any one of 0, 1,or −1, and both m and n are greater than or equal to 1.

In the solution shown in this embodiment of this application, if thereare a plurality of data blocks other than the reference block in thecombined data block group, assuming that there are m+n data blocks otherthan the reference block, the compressed data of m data blocks may beset before the data of the reference block, and the compressed data of ndata blocks may be set after the data of the reference block. If m+n isan odd number, a relationship between m and n may be that m−n is equalto 1 or −1. If m+n is an even number, a relationship between m and n maybe that m−n is equal to 0. In this way, during data reading, ifdifferential data of a data block after the reference block needs to beread, the reading may directly start from the reference block until thedifferential data of the data block is read, without a need to readdifferential data of all data blocks. If differential data of a datablock before the reference block needs to be read, the reading maydirectly start from the differential data of the data block, and endsafter the data of the reference block is read, without a need to readall the data. Therefore, less data is read, and reading efficiency isimproved.

In a possible implementation, the determining, based on the similarfingerprint, a combined data block group to which the target data blockbelongs includes: determining, based on a similar fingerprint quantity,a data block group corresponding to the target data block, where thesimilar fingerprint quantity is a quantity of same similar fingerprintsin any two data blocks in one data block group; and forming, in the datablock group corresponding to the target data block, a first quantity ofdata blocks that have a same target fingerprint as the target data blockinto the combined data block group to which the target data blockbelongs, where a data amount of differential data between the targetdata block and a data block that has the target fingerprint is less thana data amount of differential data between the target data block and adata block that does not have the target fingerprint.

In the solution shown in this embodiment of this application, a similarfingerprint table is established in the storage system. The similarfingerprint table includes a correspondence between each similarfingerprint and metadata information of a data block. The storage systemmay determine, based on the similar fingerprint table, an uncompresseddata block corresponding to each fingerprint in the similar fingerprintof the target data block. Then, data blocks having a similar fingerprintquantity of same fingerprints are grouped into one group by using thesimilar fingerprint quantity. In this way, the data block groupcorresponding to the target data block can be obtained. In the datablock group corresponding to the target data block, data blocks thathave a same target fingerprint as the target data block and that do notform a combined data block group with another data block may besuccessively selected from each data block group, to form the combineddata block group to which the target data block belongs. A combined datablock group to which any data block belongs may be determined in thismanner. Because the data amount of the differential data between thetarget data block and the data block that has the target fingerprint isless than the data amount of the differential data between the targetdata block and the data block that does not have the target fingerprint,the data block that has the target fingerprint and the target data blockare selected to form a combined data block group, and are compressedtogether. This can improve the reduction rate.

In a possible implementation, the determining, based on the similarfingerprint, a combined data block group to which the target data blockbelongs includes: determining, based on a similar fingerprint quantity,a data block group corresponding to the target data block, where thesimilar fingerprint quantity is a quantity of same similar fingerprintsin any two data blocks in one data block group; determining a quantityof same similar fingerprints in the target data block and in a datablock in each data block group; and forming, in the data block groupcorresponding to the target data block, a first quantity of data blocksthat have a maximum quantity of same similar fingerprints as the targetdata block into the combined data block group to which the target datablock belongs.

In the solution shown in this embodiment of this application, a similarfingerprint table is established in the storage system. The similarfingerprint table includes a correspondence between each similarfingerprint and metadata information of a data block. The storage systemmay determine, based on the similar fingerprint table, an uncompresseddata block corresponding to each fingerprint in the similar fingerprintof the target data block. Then, data blocks having a similar fingerprintquantity of same fingerprints are grouped into one group by using thesimilar fingerprint quantity. In this way, the data block groupcorresponding to the target data block can be obtained. For the targetdata block, in the data block group corresponding to the target datablock, data blocks that do not form a combined data block group withanother data block are determined, a quantity of same similarfingerprints in the data blocks and in the target data block isdetermined, and then the data blocks are arranged in descending order.The first quantity of the data blocks are consecutively selected fromthe beginning in sequence, to form the combined data block group towhich the target data block belongs. In this way, if data blocks havemore same similar fingerprints, it indicates that data blocks are moresimilar. Data blocks having a relatively large quantity of same similarfingerprints may be selected to form the combined data block group, sothat a data reduction rate can be improved.

According to an aspect, an apparatus for compressing data of a storagesystem is provided. The apparatus includes one or more modules,configured to perform the foregoing method for compressing data of astorage system.

According to an aspect, a storage device is provided. The storage deviceincludes an interface and a processor. The interface and the processorcooperate to perform the foregoing method for compressing data of astorage system.

According to an aspect, a computer-readable storage medium is provided.The computer-readable storage medium stores an instruction, and when thecomputer-readable storage medium runs on a storage system, the storagesystem is enabled to perform the foregoing method for compressing dataof a storage system.

According to an aspect, a computer program product includes aninstruction is provided. When the computer program product runs on astorage system, the storage system is enabled to perform the foregoingmethod for compressing data of a storage system.

The technical solutions provided in this application include at leastthe following beneficial effects:

In the embodiments of this application, when a data block is stored,whether deduplication can be performed on a target data block isdetermined; when deduplication cannot be performed on the target datablock, a similar fingerprint of the target data block is obtained; acombined data block group to which the target data block belongs isdetermined based on the similar fingerprint; and similar compression isperformed on the target data block based on a reference block in thecombined data block group. In this way, similar compression anddeduplication are combined. When deduplication cannot be performed,similar compression can be used to further compress some data, toimprove a reduction rate.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is an architectural diagram of a storage system according to anexample embodiment of this application;

FIG. 2 is a structural diagram of a storage system according to anexample embodiment of this application;

FIG. 3 is a flowchart of a method for compressing data of a storagesystem according to an example embodiment of this application;

FIG. 4 is a schematic diagram of storage of a data block according to anexample embodiment of this application;

FIG. 5 is a schematic diagram of storage of a data block according to anexample embodiment of this application;

FIG. 6 is a schematic diagram of storage of a storage block according toan example embodiment of this application;

FIG. 7 is a schematic diagram of a compressed block according to anexample embodiment of this application;

FIG. 8 is a flowchart of a method for reading data according to anexample embodiment of this application; and

FIG. 9 is a schematic diagram of a structure of an apparatus forcompressing data of a storage system according to an example embodimentof this application.

DESCRIPTION OF EMBODIMENTS

To make the objectives, technical solutions, and advantages of thisapplication clearer, the following further describes the implementationsof this application in detail with reference to the accompanyingdrawings.

To facilitate understanding of the embodiments of this application, thefollowing first describes a system architecture and concepts of nouns inthe embodiments of this application.

The embodiments of this application are applicable to a storage systemin the storage field. The storage system may be a server with a storagefunction, a server cluster with a storage function, a storage array, adistributed storage system, or the like. An architecture of the storagesystem may be shown in FIG. 1. The storage system may include a spacemanagement layer, a data management layer, and an underlying storagelayer. The space management layer may include a plurality of executionmodules. The underlying storage layer may also include a plurality ofexecution modules. The space management layer may be configured toconnect to an upper layer, receive data, and send the data to the datamanagement layer. The data management layer may be configured tocompress an input data block to obtain a compressed data block, and sendthe compressed data block to the underlying storage layer for storage.For example, an input of the data management layer is data blocks A1, B,A2, . . . , X, and B, and an output of the data management layer is datablocks A1, A2, B, . . . , and X. Because only one B is stored, theamount of stored data is reduced.

Compression: A compression technology can be classified into losslesscompression and lossy compression. Lossless compression means thatcompressed data is decompressed, and obtained data the same data asoriginal data. The storage system mainly uses compression algorithms,such as Huffman encoding, lempel ziv zelch (lzw), and deflaft. Lossycompression means that compressed data is decompressed, and obtaineddata is different from original data. Lossy compression is mainlyapplicable to the field of image or video compression.

Deduplication: Same files or data blocks in a distributed storage systemare eliminated, to effectively reduce physical storage space occupied bydata. This technology can be used in storage backup and archivingsystems. Generally, a file is divided into a plurality of data blocks, adeduplication fingerprint of each data block is calculated, and datawith same fingerprints indicates that data blocks have same content.Therefore, original data can be stored only once for data blocks withsame fingerprints, to reduce a data amount.

An embodiment of this application provides a method for compressing dataof a storage system. The method may be performed by the storage system.

FIG. 2 is a block diagram of a structure of a storage system accordingto an embodiment of this application. The storage system may include atleast an interface 201 and a processor 202. The interface 201 may beconfigured to receive data. In a specific implementation, the interface201 may be a hardware interface, for example, a network interface card(network interface card, NIC) or a host bus adapter (host bus adapter,HBA), or may be a program interface module. The processor 202 may be acombination of a central processing unit and a memory, or may be a fieldprogrammable gate array (field programmable gate array, FPGA) or otherhardware. The processor 202 may alternatively be a combination of acentral processing unit and other hardware, for example, a combinationof the central processing unit and an FPGA. The processor 202 may be acontrol center of the storage system, and is connected to all parts ofthe entire storage system through various interfaces and lines. In apossible implementation, the processor 202 may include one or moreprocessing cores. Further, the storage system further includes a harddisk, configured to provide storage space for the storage system.

An embodiment of this application provides a method for compressing dataof a storage system. As shown in FIG. 3, an execution procedure of themethod may include the following steps.

Step 301: Determine whether deduplication can be performed on a targetdata block.

During implementation, after the storage system is online, if anupper-layer application needs to store data, the upper-layer applicationmay send the data to the storage system. The storage system receives thedata. If a data amount of the data is relatively large, the storagesystem may divide the data into data blocks, and a size of each datablock may be 4 KB, 8 KB, or another value. If the data amount of thedata is less than a data amount of one data block, the data may bedirectly determined as a data block. Then, the storage system mayperiodically process the data blocks, or process the data blocks inbatches when a data amount of received data blocks reaches a specificvalue. Any data block in the data blocks processed in batches this timemay be the target data block, and whether deduplication can be performedon the target data block may be determined based on a current status ofthe storage system, whether the storage system stores a deduplicationfingerprint of the target data block, or the like.

In the step 301, there are a plurality of manners of determining whetherdeduplication can be performed on the target data block. The followingprovides two feasible implementations.

Manner 1: Generate the deduplication fingerprint of the data block, andquery whether the storage system has a fingerprint that is the same asthe deduplication fingerprint, to determine whether deduplication can beperformed on the data block.

During implementation, a deduplication fingerprint table is recorded inthe storage system. The deduplication fingerprint table includes adeduplication fingerprint of a data block that is compressed and stored,a deduplication fingerprint of a received data block that is notcompressed, and metadata information of the corresponding data block.Each time after the storage system determines the deduplicationfingerprint of the data block, the storage system correspondingly addsthe deduplication fingerprint and corresponding metadata information tothe deduplication fingerprint table. The metadata information of thedata block includes an identifier indicating whether the data block is areference block or a duplicate block (if the data block is determined, alocation where the data block is not determined may not be filled, andis subsequently filled after the data block is determined), a storagelocation (if the data block is stored, the data block has a storagelocation), and the like.

The storage system may input the target data block into a fingerprintextraction function, to obtain the deduplication fingerprint of thetarget data block as an output result. The fingerprint extractionfunction may be a hash function. Then, the storage system determines, inthe deduplication fingerprint table by using the deduplicationfingerprint, whether the deduplication fingerprint exists in thereceived data block that is not compressed. If the deduplicationfingerprint exists in the received data block that is not compressed, itmay indicate that a same data block exists. In this case, deduplicationcan be performed on the target data block. If the deduplicationfingerprint does not exist in the received data block that is notcompressed, it may indicate that the target data block does not exist.In this case, deduplication cannot be performed on the target datablock.

Manner 2: Determine a load of the storage system to determine whetherdeduplication can be performed on the data block.

During implementation, the storage system may determine the load of thestorage system, and determine whether a current CPU usage exceeds afirst value. The load may be reflected by a CPU usage, a storage spaceusage, and a current time period. If the current CPU usage exceeds thefirst value, the storage system may determine that the load meets a loadexceeding condition, and perform deduplication. The storage system maydetermine whether a current storage space usage exceeds a second value.If the current storage space usage exceeds the second value, the storagesystem may determine that the load meets the load exceeding condition,and perform deduplication. The storage system may determine a currenttime point, to determine a time period in which the current time pointis located. If the time period in which the current time point islocated is a target time period, and the target time period may be from7:00 to 24:00, the load meets the load exceeding condition. The storagesystem may perform any one or more of the foregoing operations todetermine that the load meets the load exceeding condition. If none ofthe foregoing conditions is met, the load does not meet the loadexceeding condition. The storage system may concurrently determinewhether the current CPU usage exceeds the first value, whether thecurrent storage space usage exceeds the second value, and the currenttime period. As long as one determining result is that the load meetsthe load exceeding condition, the storage system may stop remainingdetermining operations.

If the storage system determines that the load of the storage systemmeets the load exceeding condition, the storage system may performdeduplication on the data block. If the storage system determines thatthe load of the storage system does not meet the load exceedingcondition, the storage system may determine that deduplication does notneed to be performed.

It should be noted that, because a CPU needs to be occupied each time adata block is compressed, the CPU needs to be considered. Becausestorage space is also occupied when duplicate data is stored, thestorage space also needs to be considered. In some time periods, theupper-layer application stores a large amount of data, and in anothertime period, upper-layer application stores a small amount of data.Therefore, deduplication needs to be performed during peak hours anddoes not need to be performed during off-peak hours.

Step 302: When deduplication cannot be performed on the target datablock, obtain a similar fingerprint of the target data block.

The similar fingerprint may include one or more fingerprints.

During implementation, when deduplication cannot be performed on thetarget data block, the storage system may obtain the similar fingerprintof the target data block. The similar fingerprint of the target datablock may be added to a similar fingerprint table. The storage systemstores the similar fingerprint table. The similar fingerprint tableincludes a correspondence between each similar fingerprint and metadatainformation of a data block. Similar fingerprints included in thesimilar fingerprint table are similar fingerprints of data blocks whosethe similar fingerprints are determined (including a similar fingerprintof uncompressed data and a similar fingerprint of a compressed datablock). For any data block, the metadata information in the similarfingerprint table includes information such as an identifier indicatingwhether the data block is a reference block or a similar block (if thedata block is determined, a location where the data block is notdetermined may not be filled, and is subsequently filled after the datablock is determined), and a storage location (if the data block isstored, the data block has a storage location). In addition, themetadata information may further record a strongly similar fingerprint,for example, a target fingerprint identifier mentioned below. A stronglysimilar fingerprint of a data block is determined based on all similarfingerprints of the data block, and may be obtained by performingprocessing, for example, weighting (for example, there are three similarfingerprints: a fingerprint 1, a fingerprint 2, and a fingerprint 3,each fingerprint corresponds to a weight value, and the weight valuesrespectively are a, b, and c. A sum of a, b, and c is equal to 1, andthe strongly similar fingerprint is equal to a*fingerprint1+b*fingerprint 2+c*fingerprint 3). This may reflect all fingerprints inthe similar fingerprint. The similar fingerprint table may be stored ina form of a table. As shown in Table 1, similar fingerprints include afingerprint 1, a fingerprint 2, . . . , and a fingerprint n. Thefingerprint 1 corresponds to metadata information of a data block 1,metadata information of a data block 2, metadata information of a datablock 3, and the like. The fingerprint 2 corresponds to the metadatainformation of the data block 2, the metadata information of the datablock 3, metadata information of a data block 5, and the like. Thefingerprint n corresponds to the metadata information of the data block1, metadata information of a data block 4, and the like.

TABLE 1 Fingerprint Metadata information Fingerprint 1 The metadatainformation of the data block 1, the metadata information of the datablock 2, and the metadata information of the data block 3 Fingerprint 2The metadata information of the data block 2, the metadata informationof the data block 3, and the metadata information of the data block 5 .. . . . . Fingerprint n The metadata information of the data block 1,and the metadata information of the data block 4

It should be noted that, in this embodiment of this application, thesimilar fingerprint of the target data block may be determined when itis determined that deduplication cannot be performed on the target datablock, or the similar fingerprint of the target data block may bedetermined when whether deduplication can be performed is determined.When a similar fingerprint is determined, a hash algorithm may be usedto determine the similar fingerprint of the target data block. Theprocessing may be: dividing the target data block into a plurality ofsmall data units (each data unit has a same length), and calculating ahash value, namely the similar fingerprint of the target data block, foreach data unit by using a preset hash function.

It should be noted that, when the storage system is just online, thesimilar fingerprint table is blank. As time goes by, more data blocksare stored, and the similar fingerprint table is increasingly large.

It should be further noted that the foregoing hash functions fordetermining the deduplication fingerprint and the similar fingerprintare different functions.

In addition, when deduplication cannot be performed on the target datablock, the similar fingerprint of the target data block is directlydetermined. Alternatively, when deduplication cannot be performed on thetarget data block, the similar fingerprint of the target data block maynot be directly determined, instead, whether the load of the storagesystem meets the load exceeding condition is determined (for determiningprocessing, refer to the foregoing implementation 2). If the loadexceeding condition is met, the similar fingerprint of the target datablock may be generated. If the load exceeding condition is not met,subsequent similar compression processing may not be performed, in otherwords, the similar fingerprint of the target data block is notdetermined, and subsequent steps 303 and 304 may not be performed.

Step 303: Determine, based on the similar fingerprint, a combined datablock group to which the target data block belongs.

During implementation, the storage system may determine, in the similarfingerprint table by using the similar fingerprints of the target datablock, a data block group corresponding to each fingerprint in thesimilar fingerprint of the target data block, and then determine, in thedata block groups, the combined data block group to which the targetdata block belongs.

In an optional implementation, the combined data block group to whichthe target data block belongs may be determined in a plurality ofmanners. The following provides two feasible manners.

Manner 1: Determine, based on a similar fingerprint quantity, a datablock group corresponding to the target data block, where the similarfingerprint quantity is a quantity of same similar fingerprints in anytwo data blocks in one data block group; and form, in the data blockgroup corresponding to the target data block, a first quantity of datablocks that have a same target fingerprint as the target data block intothe combined data block group to which the target data block belongs.

The similar fingerprint quantity may be set in advance, and is stored inthe storage system. For example, the similar fingerprint quantity may be2. The similar fingerprint quantity is related to a quantity of similarfingerprints extracted from each data block. Generally, a largerquantity of fingerprints included in a similar fingerprint indicates alarger similar fingerprint quantity, and a smaller quantity offingerprints included in a similar fingerprint indicates a smallersimilar fingerprint quantity. A data amount of differential data betweendata blocks having the target fingerprint (which may also be referred toas a strongly similar fingerprint) is the smallest, so that a dataamount of compressed data of the data blocks is the smallest. The firstquantity may be preset, for example, 8.

During implementation, the foregoing similar fingerprint table isestablished in the storage system, and an uncompressed data blockcorresponding to each fingerprint in the similar fingerprint of thetarget data block may be determined from the similar fingerprint table.Then, data blocks having a similar fingerprint quantity of samefingerprints are grouped into one group by using the similar fingerprintquantity, to determine a data block group where the target data block islocated, namely the data block group corresponding to the target datablock.

When the target data block has not been selected as a member of anotherreference block, in the data block group corresponding to the targetdata block, data blocks that have a same target fingerprint as thetarget data block and that do not form a combined data block group withanother data block may be successively selected from each data blockgroup, to form the combined data block group to which the target datablock belongs. For any data block, a combined data block group to whicheach data block belongs may be determined in this manner.

For the target data block, when a member is selected for the otherreference block, if a target fingerprint exists in both a referenceblock and the target data block, a combined data block group to whichthe reference block belongs may be determined as the combined data blockgroup to which the target data block belongs.

It should be noted that, if a quantity of data blocks in a data blockgroup is limited, after the quantity of data blocks in the combined datablock group reaches the first quantity, no data block is further addedto the combined data block group. For example, the target data blockcorresponds to three data block groups. When a quantity of data blocksthat are in the first two data block groups and that have the targetfingerprint of the target data block has reached the first quantity, thedata blocks form a combined data block group. In this case, the combineddata block group to which the target data block belongs is determined.

For example, similar fingerprints of a target data block C3 include afingerprint 1, a fingerprint 2, and a fingerprint 3, a targetfingerprint of C3 is a fingerprint 4, and the similar fingerprintquantity is 1. The fingerprint 1 in the similar fingerprints of the datablock corresponds to data blocks C0, C1, C2, C3, C4, C5, and C6. Thefingerprint 2 corresponds to data blocks C0, D1, C3, D3, C5, and C7. Thefingerprint 3 corresponds to data blocks C0, C3, C5, C7, D5, and D6.Because the similar fingerprints of the target data block include thefingerprint 1, the fingerprint 2, and the fingerprint 3, a data blockgroup formed by the data blocks corresponding to the fingerprint 1 is adata block group corresponding to the target data block, a data blockgroup formed by the data blocks corresponding to the fingerprint 2 is adata block group corresponding to the target data block, and a datablock group formed by the data blocks corresponding to the fingerprint 3is a data block group corresponding to the target data block. For thefingerprint 1, C0 is selected as a reference block. If both C3 and C0have a same strongly similar fingerprint (namely, the targetfingerprint), C3 may be left in a data block group in which C0 is usedas a reference block, and the data block group in which C0 is used as areference block is a combined data block group to which the target datablock C3 belongs. For the fingerprint 1, the target fingerprint alsoexists in C5 and C6. In this case, C5 and C6 may be added to the datablock group in which C0 is used as a reference block. For thefingerprint 2, the target fingerprint also exists in C7, and C7 may beadded to the data block group in which C0 is used as a reference block.Because the target fingerprint exists in all selected data blocks, thetarget fingerprint exists in all data blocks in the combined data blockgroup.

Manner 2: Determine, based on a similar fingerprint quantity, a datablock group corresponding to the target data block, where the similarfingerprint quantity is a quantity of same similar fingerprints in anytwo data blocks in one data block group; determine a quantity of samesimilar fingerprints in the target data block and in a data block ineach data block group; and form, in the data block group correspondingto the target data block, a first quantity of data blocks that have amaximum quantity of same similar fingerprints as the target data blockinto the combined data block group to which the target data blockbelongs.

During implementation, the foregoing similar fingerprint table isestablished in the storage system, and an uncompressed data blockcorresponding to each fingerprint in the similar fingerprint of thetarget data block may be determined from the similar fingerprint table.Then, data blocks having a similar fingerprint quantity of samefingerprints are grouped into one group by using the similar fingerprintquantity, to determine a data block group where the target data block islocated, namely the data block group corresponding to the target datablock.

When the target data block has not been selected as a member of anotherreference block, data blocks that are in the data block groupcorresponding to the target data block and that do not form a combineddata block group with another data block are determined, a quantity ofsame similar fingerprints in the data blocks and in the target datablock is determined, and then data blocks are arranged in descendingorder. The first quantity of the data blocks are selected to form thecombined data block group to which the target data block belongs.

For the target data block, when a member is selected for the otherreference block, the first quantity of members need to be selected for areference block. In a ranking (in descending order) of quantities ofsame similar fingerprints in the reference block and in uncompresseddata blocks that are in a data block group corresponding to thereference block, if the target data block belongs to the first quantity,a combined data block group to which the reference block belongs may bedetermined as the combined data block group to which the target datablock belongs.

For example, similar fingerprints of a target data block E3 include afingerprint 1, a fingerprint 2, and a fingerprint 3, and the similarfingerprint quantity is 1. The fingerprint 1 in the similar fingerprintsof the data block corresponds to data blocks E0, E1, E2, E3, E4, E5, andE6. The fingerprint 2 corresponds to data blocks E0, F1, E3, F3, E5, andE7. The fingerprint 3 corresponds to data blocks E0, E3, E5, E7, F5, andF6. Because the similar fingerprints of the target data block includethe fingerprint 1, the fingerprint 2, and the fingerprint 3, a datablock group formed by the data blocks corresponding to the fingerprint 1is a data block group corresponding to the target data block, a datablock group formed by the data blocks corresponding to the fingerprint 2is a data block group corresponding to the target data block, and a datablock group formed by the data blocks corresponding to the fingerprint 3is a data block group corresponding to the target data block. Currently,uncompressed data blocks include E2, E4, E5, E6, F5, and F6. Quantitiesof same similar fingerprints are arranged in a descending order as E4,E5, F5, F6, E2, and E6. The first quantity is 6. E4, E5, F5, F6, E2, andE3 may be selected to form a combined data block group.

It should be noted that, if a quantity of data blocks in a data blockgroup is limited, after the quantity of data blocks in the combined datablock group reaches the first quantity, no data block is further addedto the combined data block group.

In addition, after a second quantity of batch processing cycles (forexample, the second quantity may be 2), if no similar data block orrepeated data block is found for some data blocks, the data blocks maybe directly compressed and stored in a conventional manner.Alternatively, after the second quantity of batch processing processes,if no similar data block or repeated data block is found for some datablocks, the data blocks may be directly compressed and stored in aconventional manner.

In addition, in this embodiment of this application, the data blockgroup corresponding to the target data block may alternatively bedetermined not based on the similar fingerprint quantity. A first datablock of processed data blocks in the batch is used as the referenceblock. The first quantity of data blocks having same strongly similarfingerprints are selected from the remaining data blocks, to form a datablock group to which the first data block belongs. Alternatively, thefirst quantity of data blocks having a maximum quantity of same similarfingerprints as the first data block are selected from the remainingdata blocks, to form a data block group to which the first data blockbelongs. Next, a first data block is selected from data blocks that donot form a data block group as the reference block, and then processingof selecting data blocks from the remaining data blocks continues to beperformed, to obtain a data block group to which the first data blockbelongs. In this manner, the combined data block group to which thetarget data block belongs may be obtained. In addition, if the firstquantity of data blocks having the same strongly similar fingerprintscannot be selected for the first data block, a data block whose similarfingerprint quantity exceeds a value may be selected after the currentselection, and added to the data block group using the first data blockas the reference block.

For the foregoing combined data block group, manners of determining thereference block are further provided in this embodiment of thisapplication.

Manner 1: In the combined data block group, a first added data block isdetermined as the reference block.

During implementation, in the combined data block group, an adding orderof each data block may be determined, and an earliest added data blockis determined as the reference block of the combined data block group.For example, in the foregoing example, C0 is first added, and C0 isdetermined as the reference block.

Manner 2: In the combined data block group, a data block that has ahighest data reduction rate of the combined data block group isdetermined as the reference block.

During implementation, when the data blocks in the combined data blockgroup are compressed, any data block is used as the reference block, andeach data block in the combined data block group is compressed to obtaincompressed data of each data block in the combined data block group.Then, a data amount of the combined data block group before compressionis compared with a data amount of the compressed data of the combineddata block group, to obtain a reduction rate corresponding to thereference block. For any reference block, this manner may be used todetermine a reduction rate corresponding to the reference block. A datablock with a largest reduction rate is selected as the reference block.

For example, the combined data block group includes three data blocks:A1, A2, and A3. When A1 is used as the reference block, an overallreduction rate of the combined data block group is 77%. When A2 is usedas the reference block, the overall reduction rate of the combined datablock group is 65%. When A3 is used as the reference block, the overallreduction rate of the combined data block group is 50%. In this way, itmay be obtained that the overall reduction rate of the combined datablock group is the highest when A1 is used as the reference block.Therefore, in the combined data block group, A1 may be used as thereference block.

It should be noted that determining efficiency of the foregoing manner 1of determining the reference block is relatively high, but a data blockwith a highest reduction rate may not be selected. In the foregoingmanner 2 of determining the reference block, although a compressed blockwith a high reduction rate can be determined, a selection process iscomplex and efficiency is relatively low. Therefore, when there are arelatively large quantity of data blocks in the combined data blockgroup, the manner 1 of determining the reference block may be selected,to improve selection efficiency. However, when there are a relativelysmall quantity of data blocks in the combined data block group, themanner 2 of determining the reference block may be selected, to providea high reduction rate.

Step 304: Perform similar compression on the target data block based onthe reference block in the combined data block group.

During implementation, the storage system may determine differentialdata between the target data block and the reference block in thecombined data block group. If the reference block has been compressed,the differential data may be directly compressed to obtain compresseddata of the target data block. If the reference block has not beencompressed, the reference block may be compressed, and the differentialdata is compressed. Subsequently, data of the target data block may berestored by using data of the reference block and the differential data.

In an optional implementation, in the storage system, data in a samecombined data block group may be stored in a storage block, and may bestored in a same storage block, or may be stored in different storageblocks. This is not limited in this embodiment of this application. Whenthe data is stored in different storage blocks, if the reference blockand differential data of a currently to-be-read data block are in a samestorage block, the reference block and the differential data may bedirectly read from the storage block at a time (if the reference blockis in the front, the reading may be performed from the reference blockto the differential data of the to-be-read data block; and if thereference block is in the back, the reading may be performed from thedifferential data of the to-be-read data block to the reference block).If the reference block and the differential data of the currentlyto-be-read data block are not in a same storage block, the referenceblock and the differential data may be separately read from differentstorage blocks.

In an optional implementation, in the storage system, compressed data isstored in a storage block. During storage, compressed data of a samecombined data block group needs to be stored in one storage block and isconsecutively stored. The processing may be as follows:

consecutively storing, in a same storage block, compressed data obtainedafter similar compression is performed on data blocks, and compresseddata of another data block in the combined data block group.

During implementation, when compressed data obtained after similarcompression is performed on the target data block is stored, a storageblock in which the compressed data of the other data block in thecombined data block group is stored and a storage location of thecompressed data in the storage block may be determined. Then, thecompressed data obtained after similar compression is performed on thetarget data block and the compressed data of the other data block in thecombined data block group are consecutively stored together.

For example, as shown in FIG. 4, the data blocks are A1 and A2, A0 is areference block, dA1 is differential data between A1 and A0, and dA2 isdifferential data between A2 and A0. A0, dA1, and dA2 may be stored in asame storage block, and are consecutively stored, where A0 is adjacentto dA1, and dA1 is adjacent to dA2.

It should be noted that, because a processing resource is consumed eachtime data is read, the reference block and the differential data aregenerally read at a time instead of being read twice, to save theprocessing resource. Therefore, the compressed data in the foregoingcombined data block group is consecutively stored in one storage block,so that the reference block, and the differential data between the otherdata block and the reference block may be read at a time during reading.

In an optional implementation, to reduce an amount of data read at atime, a location in which the data of the reference block is stored maybe configured, and the processing may be as follows:

if there are a plurality of data blocks other than the reference blockin the combined data block group, compressed data of m data blocks isbefore the data of the reference block, and compressed data of n datablocks is after the reference block, where a difference between m and nis equal to any one of 0, 1, or −1, and both m and n are greater than orequal to 1.

During implementation, if there are a plurality of data blocks otherthan the reference block in the combined data block group, assuming thatthere are m+n data blocks other than the reference block, the compresseddata of m data blocks may be set before the data of the reference block,and the compressed data of n data blocks may be set after the data ofthe reference block. If m+n is an odd number, a relationship between mand n may be that m−n is equal to 1 or −1. If m+n is an even number, arelationship between m and n may be that m−n is equal to 0. In this way,during data reading, if differential data of a data block after thereference block needs to be read, the reading may directly start fromthe reference block until the differential data of the data block isread, without a need to read differential data of all data blocks. Ifdifferential data of a data block before the reference block needs to beread, the reading may directly start from the differential data of thedata block, and ends after the data of the reference block is read,without a need to read all the data. Therefore, less data is read, andreading efficiency is improved.

For example, as shown in FIG. 5 that corresponds to FIG. 4, in additionto the reference block, there are two data blocks A1 and A2 in thecombined data block group, and A0 is stored between dA1 and dA2. In thisway, when data of A2 is read, the reading may directly start from A0,and ends after dA2 is read, without a need to read dA1. This can speedup the reading. When data of A1 is read, the reading may start from dA1,and ends after A0 is read, without a need to read dA2. This can speed upthe reading.

In addition, when there is one data block other than the reference blockin the combined data block group, the data of the reference block may belocated before compressed data of another data block in the combineddata block group, or may be located after the compressed data of theother data block in the combined data block group.

In addition, the length of the storage block is generally fixed. Whenthe storage block is not fully stored after data of one combined datablock group is stored in the storage block, the storage block may storedata of another combined data block group, but data of a same combineddata block group needs to be stored in a same storage block, tofacilitate subsequent reading.

To describe a structure of the storage block more clearly, an embodimentof this application further provides a structure of a storage block. Asshown in FIG. 6, in original structures of storage blocks, a storageblock 0 is used to store data blocks X, Y, Z, and G, a storage block 1is used to store data blocks A0, B0, A1, and D1, and a storage block 2is used to store data blocks A2, D0, A0, and B1. The data blocks X, Y,Z, and G are data blocks on which deduplication or similar compressionis not performed, and no change may be made. Because both the storageblock 1 and the storage block 2 have A0, deduplication may be performed,to delete one A0. Because both A1 and A2 are similar to A0, similarcompression may be performed, to obtain differential data dA1 between A1and A0 and differential data dA2 between A2 and A0. dA1, dA2, and A0 maybe placed in one storage block and stored in the storage block 1.Because B0 is similar to B1 and B0 is a reference block, similarcompression may be performed, to obtain differential data dB1 between B1and B0. Because D1 is similar to D0 and D0 is a reference block, similarcompression may be performed, to obtain differential data dD1 between D1and D0. B0 and dB1 may be stored in a same storage block, D0 and dD1 maybe stored in a same storage block, and B0, dB1, D0 and dD1 are stored inthe storage block 2. In other words, the storage blocks are classifiedinto two types. One type of storage block is used to store a data blockon which deduplication and/or similar compression are not performed, andthe other type of storage block is used to store a data block on whichdeduplication and/or similar compression are performed.

In addition, in the foregoing step 304, compressed data of a samecombined data block group may be stored in a same compressed block, andthe processing may be as follows:

if a compressed block to which the combined data block group belongs hasa remaining capacity, compress differential data between a data blockand a reference block, and store the compressed data in the compressedblock; or if a compressed block to which the combined data block groupbelongs has no remaining capacity, create a new compressed block,re-determine a data block group to which a data block belongs, select areference block from the re-determined data block group, compressdifferential data of the data block and the re-selected reference block,and store the compressed data into the newly created compressed block.

Each compressed block is used to store compressed data of one combineddata block group, and a data amount of data that can be stored in thecompressed block is a fixed value which may be 16 KB, 32 KB, or anothervalue.

During implementation, when similar compression is performed on a targetdata block in a combined data block group, a compressed block to whichthe combined data block group belongs has a remaining capacity, and theremaining capacity is greater than or equal to a data amount ofdifferential data between the target data block and the reference blockin the combined data block group, the differential data between thetarget data block and the reference block in the combined data blockgroup may be compressed, and then the compressed differential data isstored in the compressed block.

If the compressed block to which the combined data block group belongshas no remaining capacity to store the differential data of the targetdata block relative to the reference block, a new compressed block maybe created. If there is another data block that is in the combined datablock group and that is not compressed, the target data block and theother data block in the combined data block group may form a newcombined data block group, and then a reference block is determined inthe new combined data block group. A data block that is first added maybe determined as the reference block, or a data block that maximizes areduction rate of the new combined data block group may be determined asthe reference block (in this case, the target data block may also beselected as the reference block). If the target data block is not thereference block, differential data between the target data block and thereselected reference block is compressed, the compressed data is storedin the new compressed block, and the reference block in the new combineddata block group may be stored. If the target data block is thereference block, conventional compression and storage may be directlyperformed, and similar compression may be performed on another datablock with reference to the target data block. In this way, the targetdata block can be compressed.

For example, as shown in FIG. 7, a combined data block group to whichdata blocks belong includes five data blocks: a reference block, a, b,c, and d. The first data block is the reference block. Losslesscompression is performed on data of the reference block, similarcompression is performed on the other data blocks relative to thereference block, and compressed data blocks are stored in one datablock. The compressed blocks of the other data blocks are sequentiallyda, db, dc, and dd.

It should be noted that the compressed block generally can store a smallamount of data to facilitate reading. If the compressed block can storea large amount of data, when the data is read, reading needs to beperformed from the reference block to the end, to read the data at theend of the compressed block, and therefore a large amount of data isread at a time, and more resources are wasted.

According to the embodiments of this application, a new storage systemmay directly combine deduplication and similar compression, to obtain asystem with a new compression technology. For a system that is online,if there are no deduplication and similar compression, an independentprocessing mechanism may be embedded into the system.

In this embodiment of this application, when a data block is stored,whether deduplication can be performed on a target data block isdetermined; when deduplication cannot be performed on the target datablock, a similar fingerprint of the target data block is obtained; acombined data block group to which the target data block belongs isdetermined based on the similar fingerprint; and similar compression isperformed on the target data block based on a reference block in thecombined data block group. In this way, similar compression anddeduplication are combined. When deduplication cannot be performed,similar compression can be used to further compress some data, toimprove a reduction rate.

Based on the foregoing processing of compressing data, an embodiment ofthis application further correspondingly provides a process of readingcompressed data. That compressed data in a combined data block group isstored in a same storage block is used as an example. Reading steps areshown in FIG. 8.

Step 801: Receive a read request for a to-be-read data block.

During implementation, after a data block is stored in a storage system,if the data block needs to be read subsequently, a read request may besent to the storage system, and an identifier of the to-be-read datablock is carried in the read request.

Step 802: Obtain metadata information of the to-be-read data block.

During implementation, the storage system may read the metadatainformation of the to-be-read data block from a storage block (themetadata information is usually stored in a first storage block) byusing the identifier of the to-be-read data block, and the metadatainformation may include a storage block in which a reference block ofthe to-be-read data block is located, a location of the reference blockin the storage block, and an offset location of the to-be-read datablock relative to the reference block (the offset location may be anoffset data amount, a quantity of offset data blocks, or the like). Forexample, eight data blocks are shifted rightwards from the location ofthe reference block.

Step 803: Read, based on the metadata information, the reference blockof the to-be-read data block and differential data between theto-be-read data block and the reference block from the storage block towhich the reference block of the to-be-read data block belongs.

During implementation, after obtaining the metadata information, thestorage system may determine, by using the metadata information, thelocation of the reference block in the storage block and the offsetlocation of the to-be-read data block relative to the reference block.If the reference block is before the differential data of the to-be-readdata block, reading may start from the reference block until thedifferential data between the to-be-read data block and the referenceblock is read. If the reference block is after the differential data ofthe to-be-read data block, reading may start from the to-be-read datablock until the reference block is read. Data of the reference block andthe differential data between the to-be-read data block and thereference block are obtained from read data.

It should be noted herein that the storage block in which the referenceblock is located further includes a header (head) of the referenceblock, and the header is used to describe a quantity of data blocksincluded in the storage block, a data amount of the storage block, andthe like.

Step 804: Restore data of the to-be-read data block based on thereference block and the differential data.

During implementation, the storage system may superpose the data of thereference block with the differential data, to obtain all data of theto-be-read data block.

Step 805: Send the data of the to-be-read data block to a requester.

In this way, in one process of reading data from a disk, if the metadatainformation is stored in a memory, because in the storage block, readingmay start from the reference block until the differential data of theto-be-read data block is read. When differential data of another datablock exists between the reference block and the differential data ofthe to-be-read data block, the differential data of the other data blockis also read. In this case, although the differential data of the otherdata block is read, compared with first reading the data of thereference block and then reading the differential data of the to-be-readdata block, this way occupies less processing resources. It can belearned that in this application, all the data of the to-be-read datablock can be read only once.

If the metadata information is stored in the disk, the metadatainformation of the to-be-read data block is read from the disk, and thenthe differential data between the reference block and the to-be-readdata block is read from the storage block at a time. Therefore, in thisapplication, all the data of the to-be-read data block can be read onlytwice.

FIG. 9 is a structural diagram of an apparatus for compressing data of astorage system according to an embodiment of this application. Theapparatus may be implemented as a part of the apparatus or the entireapparatus by using software, hardware, or a combination thereof. Theapparatus provided in this embodiment of this application may implementthe procedure in the embodiment of this application shown in FIG. 3. Theapparatus includes a determining module 910, an obtaining module 920,and a compression module 930.

The determining module 910 is configured to determine whetherdeduplication can be performed on a target data block, and mayspecifically be configured to perform the step 301 and implicit stepsincluded therein.

The obtaining module 920 is configured to, when deduplication cannot beperformed on the target data block, obtain a similar fingerprint of thetarget data block, and may specifically be configured to perform thestep 302 and implicit steps included therein.

The determining module 910 is further configured to determine, based onthe similar fingerprint, a combined data block group to which the targetdata block belongs, and may specifically be configured to perform thestep 303 and the implicit steps included therein.

The compression module 930 is configured to perform similar compressionon the data block based on a reference block in the combined data blockgroup, and may specifically be configured to perform the step 304 andimplicit steps included therein.

In an optional implementation, the determining module 910 is configuredto:

generate a deduplication fingerprint of the target data block; and

query whether the storage system has a fingerprint that is the same asthe deduplication fingerprint, to determine whether deduplication can beperformed on the target data block.

In an optional implementation, the determining module 910 is configuredto:

determine a load of the storage system to determine whetherdeduplication can be performed on the target data block.

In an optional implementation, the compression module 930 is furtherconfigured to:

consecutively store, in a same storage block, compressed data obtainedafter similar compression is performed on the target data block, andcompressed data of another data block in the combined data block group.

In an optional implementation, if there are a plurality of data blocksother than the reference block in the combined data block group,compressed data of m data blocks is before data of the reference block,and compressed data of n data blocks is after the reference block, wherea difference between m and n is equal to any one of 0, 1, or −1, andboth m and n are greater than or equal to 1.

In an optional implementation, the determining module 910 is furtherconfigured to:

determine, based on a similar fingerprint quantity, a data block groupcorresponding to the target data block, where the similar fingerprintquantity is a quantity of same similar fingerprints in any two datablocks in one data block group; and form, in the data block groupcorresponding to the target data block, a first quantity of data blocksthat have a same target fingerprint as the target data block into thecombined data block group to which the target data block belongs, wherea data amount of differential data between the target data block and adata block that has the target fingerprint is less than a data amount ofdifferential data between the target data block and a data block thatdoes not have the target fingerprint.

In an optional implementation, the determining module 910 is furtherconfigured to:

determine, based on a similar fingerprint quantity, a data block groupcorresponding to the target data block, where the similar fingerprintquantity is a quantity of same similar fingerprints in any two datablocks in one data block group; determine a quantity of same similarfingerprints in the target data block and in a data block in each datablock group; and form, in the data block group corresponding to thetarget data block, a first quantity of data blocks that have a maximumquantity of same similar fingerprints as the target data block into thecombined data block group to which the target data block belongs.

In this embodiment of this application, when a data block is stored,whether deduplication can be performed on a target data block isdetermined; when deduplication cannot be performed on the target datablock, a similar fingerprint of the target data block is obtained; acombined data block group to which the target data block belongs isdetermined based on the similar fingerprint; and similar compression isperformed on the target data block based on a reference block in thecombined data block group. In this way, similar compression anddeduplication are combined. When deduplication cannot be performed,similar compression can be used to further compress some data, toimprove a reduction rate.

It should be noted that when the apparatus for compressing data of astorage system, provided in the foregoing embodiment, processes data,division of the foregoing functional modules is used only as an examplefor description. In actual application, the foregoing functions may beallocated to different functional modules and implemented according to arequirement, in other words, an internal structure of the apparatus isdivided into different functional modules for implementing all or someof the functions described above. In addition, the apparatus forcompressing data of a storage system, provided in the foregoingembodiment, and the embodiment of the method for compressing data of astorage system belong to a same concept. For details about a specificimplementation process of the apparatus, refer to the method embodiment.Details are not described herein again.

In an optional implementation, an embodiment of this application furtherprovides a computer-readable storage medium. The computer-readablestorage medium stores an instruction, and when the computer-readablestorage medium runs on a storage system, the storage system is enabledto perform the foregoing method for compressing data of a storagesystem.

In an optional implementation, an embodiment of this application furtherprovides a computer program product including an instruction. When thecomputer program product runs on a storage system, the storage system isenabled to perform the foregoing method for compressing data of astorage system.

All or some of the foregoing embodiments may be implemented by usingsoftware, hardware, firmware, or any combination thereof. When thesoftware is used for implementation, all or some of the embodiments maybe implemented in a form of a computer program product. The computerprogram product includes one or more computer instructions. When thecomputer program instructions are loaded and executed on a server or aterminal, all or some of the procedures or functions according to theembodiments of the present invention are generated. The computerinstructions may be stored in a computer-readable storage medium or maybe transmitted from a computer-readable storage medium to anothercomputer-readable storage medium. For example, the computer instructionsmay be transmitted from a website, computer, server, or data center toanother website, computer, server, or data center in a wired (forexample, a coaxial optical cable, an optical fiber, or a digitalsubscriber line) or wireless (for example, infrared, radio, ormicrowave) manner. The computer-readable storage medium may be anyusable medium accessible by a server or a terminal, or a data storagedevice, such as a server or a data center, integrating one or moreusable media. The usable medium may be a magnetic medium (for example, afloppy disk, a hard disk, and a magnetic tape), an optical medium (forexample, a digital video disk (Digital Video Disk, DVD)), or asemiconductor medium (for example, a solid-state drive).

The foregoing descriptions are merely example embodiments of thisapplication, but are not intended to limit this application. Anymodification, equivalent replacement, or improvement made withoutdeparting from the spirit and principle of this application should fallwithin the protection scope of this application.

What is claimed is:
 1. A method of storing data block performed by astorage device, comprising: storing data block groups, wherein each datablock group of the data block groups has a plurality of data blocks, andeach data block of said each data block group has a common partidentical to a part of another data blocks of said each data blockgroup; selecting from the data block groups a target data block group,wherein one data block in the target data block group has a common partidentical to a part of the target data block; and saving the targetblock by storing a target reference block of the target data block groupand differential data between the target data block and the targetreference block.
 2. The method according to claim 1, wherein the step ofsaving the data block groups comprises: storing, for each data blockgroup of the data block groups, a reference block and differential databetween each data block in the data block group and the reference block;wherein each data block group comprises a reference block.
 3. The methodaccording to claim 2, furthering comprising: continuously storing alldata of each data block group in storage address.
 4. The methodaccording to claim 1, furthering comprising: deduplicating data blocksin the storage system to obtain data blocks of the data block groups,wherein data blocks obtained after deduplicating are not the same. 5.The method according to claim 1, furthering comprising: comparing,fingerprints of a part of data blocks of data block groups andfingerprints of parts of the target data block for selecting.
 6. Themethod according to claim 1, furthering comprising: comparing,fingerprints of a common parts and fingerprints of parts of the targetdata block, wherein all data blocks in each data block groups have acommon part.
 7. The method according to claim 1, wherein the comparingstep comprises: obtaining the target data block by the reference blockand the differential data.
 8. A storage device, comprising: a memorystoring executable instructions; and a processor configured to executethe executable instructions to: save data block groups, wherein eachdata block group of the data block groups has a plurality of datablocks, each data block of said each data block group has a common partidentical to a part of another data blocks of said each data blockgroup; select from the data block groups a target data block group,wherein one data block in the target data block group has a common partidentical to a part of the target data block; and save the target blockby storing a target reference block of the target data block group anddifferential data between the target data block and the target referenceblock.
 9. The storage device according to claim 8, wherein the processoris configured to save the data block groups by storing, for each datablock group of the data block groups, a reference block and differentialdata between each data block in the data block group and the referenceblock; wherein each data block group comprises a reference block. 10.The storage device according to claim 9, wherein the processor isconfigured to further execute the executable instructions to:continuously store all data of each data block group in storage address.11. The storage device according to claim 8, wherein the processor isconfigured to further execute the executable instructions to:deduplicate data blocks to obtain data blocks of the data block groups,wherein data blocks obtained after deduplicating are not the same. 12.The storage device according to claim 8, wherein the processor isconfigured to: compare fingerprints of a part of data blocks of datablock groups and fingerprints of parts of the target data block.
 13. Thestorage device according to claim 8, wherein the processor is configuredto: compare fingerprints of a common parts and fingerprints of parts ofthe target data block, wherein all data blocks in each data block groupshave a common part.
 14. The storage device according to claim 8, whereinthe processor is configured to: obtain the target data block by thereference block and the differential data.
 15. A storage device,comprising: a memory storing executable instructions; and a processorconfigured to execute the executable instructions to: determine datablocks in a plurality of data blocks, if one data block has a commonpart identical to a part of another data blocks; select a group of datablocks from the plurality of data blocks, wherein each data block ofsaid each data block group has a common part identical to a part ofanother data blocks of said each data block group; and save the group ofdata blocks by storing a reference block and differential data betweeneach data block in the group and the reference block.
 16. The storagedevice according to claim 15, wherein the processor is configured to:compare fingerprints between different data blocks of the plurality ofdata blocks, wherein each said data block of the plurality of data hasat least one fingerprints.
 17. The storage device according to claim 15,wherein the processor is configured to further execute the executableinstructions to: continuously store all data of the data block group instorage address.
 18. The storage device according to claim 15,furthering comprising: deduplicate data blocks in the storage device toobtain the plurality of data blocks.
 19. The storage device according toclaim 15, the selecting step comprising: select a group of data blocksfrom a number of data blocks, according to fingerprint of part of eachdata block in the number of data blocks.
 20. The storage deviceaccording to claim 15, wherein the processor is configured to: selectthe group of data blocks from the plurality of data blocks according toall data blocks in each data block groups have a common part.